kernel: clarify and assert interrupts aren't disabled upon context switch #93440

mathieuchopstm · 2025-07-21T12:10:47Z

The do_swap() routine used when CONFIG_USE_SWITCH=y asserts that caller thread does not hold any spinlock when CONFIG_SPIN_VALIDATE is enabled. However, there is no similar check in place when CONFIG_USE_SWITCH=n.

Copy this assertion in the USE_SWITCH=n implementation of z_swap_irqlock().

The do_swap() routine used when CONFIG_USE_SWITCH=y asserts that caller thread does not hold any spinlock when CONFIG_SPIN_VALIDATE is enabled. However, there is no similar check in place when CONFIG_USE_SWITCH=n. Copy this assertion in the USE_SWITCH=n implementation of z_swap_irqlock(). Signed-off-by: Mathieu Choplain <mathieu.choplain@st.com>

andyross

For clarity: this is useful for architectures like arm32 that have an arch_swap() and miss the check elsewhere. The error being detected is always wrong.

krish2718

Thanks, this caught the bug in my case where a thread holding the spinlock tries to sleep (mutex).

mathieuchopstm · 2025-07-22T08:39:31Z

For the record, this patch breaks part of the irq_lock() specification when CONFIG_SPIN_VALIDATE=y since there's no way to distinguish between k_spin_lock() and irq_lock():

zephyr/include/zephyr/irq.h

Lines 242 to 251 in 1f69b91

    
            * @note 
        
            * This routine can be called by ISRs or by threads. If it is called by a 
        
            * thread, the interrupt lock is thread-specific; this means that interrupts 
        
            * remain disabled only while the thread is running. If the thread performs an 
        
            * operation that allows another thread to run (for example, giving a semaphore 
        
            * or sleeping for N milliseconds), the interrupt lock no longer applies and 
        
            * interrupts may be re-enabled while other processing occurs. When the thread 
        
            * once again becomes the current thread, the kernel re-establishes its 
        
            * interrupt lock; this ensures the thread won't be interrupted until it has 
        
            * explicitly released the interrupt lock it established.

Using QEMU, I verified that the existing assertion (when CONFIG_USE_SWITCH=y, from bd07756) does panic if the irq_lock() is held upon context switch:

diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index c550ab461cb..be87d15501c 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -5,10 +5,13 @@
  */
 
 #include <stdio.h>
+#include <zephyr/kernel.h>
 
 int main(void)
 {
+       irq_lock();
        printf("Hello World! %s\n", CONFIG_BOARD_TARGET);
+       k_msleep(1000);
 
        return 0;
 }

$ # with upstream
$ west build -b qemu_x86_64 samples/hello_world/ -DCONFIG_USE_SWITCH=y -DCONFIG_ASSERT=y -DCONFIG_SPIN_VALIDATE=y -p
$ west build -t run
-- west build: running target run
[3/4] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: qemu64,+x2apic
qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]
qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]
SeaBIOS (version zephyr-v1.0.0-0-g31d4e0e-dirty-20200714_234759-fv-az50-zephyr)
Booting from ROM..
*** Booting Zephyr OS build v4.2.0-334-g124fb897b490 ***
Hello World! qemu_x86_64/atom
ASSERTION FAIL [arch_irq_unlocked(key) || z_smp_current_get()->base.thread_state & (((1UL << (0))) | ((1UL << (3))))] @ WEST_TOPDIR/zephyr/kernel/include/kswap.h:98
        Context switching while holding lock!
RAX: 0x0000000000000004 RBX: 0x000000000011fd90 RCX: 0x0000000000000001 RDX: 0x0000000000000000
RSI: 0x0000000000000062 RDI: 0x00000000001090d2 RBP: 0x000000000012bd70 RSP: 0x000000000012bd48
 R8: 0x0000000000000000  R9: 0x0000000000000000 R10: 0x0000000000000002 R11: 0x0000000000000000
R12: 0x000000000010b500 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x000000000012be48
RSP: 0x000000000012bd48 RFLAGS: 0x0000000000000002 CS: 0x0018 CR3: 0x0000000000136000
RIP: 0x000000000010046f
call trace:
     0: 0x00000000001051d5
     1: 0x000000000010524d
     2: 0x0000000000100025
     3: 0x0000000000102f2d
     4: 0x000000000010045a

On a related note, the k_spin_lock() documentation does not mention whether or not holding a k_spinlock is allowed upon reschedule:

zephyr/include/zephyr/spinlock.h

Lines 154 to 181 in 1f69b91

    
           /** 
        
            * @brief Lock a spinlock 
        
            * 
        
            * This routine locks the specified spinlock, returning a key handle 
        
            * representing interrupt state needed at unlock time.  Upon 
        
            * returning, the calling thread is guaranteed not to be suspended or 
        
            * interrupted on its current CPU until it calls k_spin_unlock().  The 
        
            * implementation guarantees mutual exclusion: exactly one thread on 
        
            * one CPU will return from k_spin_lock() at a time.  Other CPUs 
        
            * trying to acquire a lock already held by another CPU will enter an 
        
            * implementation-defined busy loop ("spinning") until the lock is 
        
            * released. 
        
            * 
        
            * Separate spin locks may be nested. It is legal to lock an 
        
            * (unlocked) spin lock while holding a different lock.  Spin locks 
        
            * are not recursive, however: an attempt to acquire a spin lock that 
        
            * the CPU already holds will deadlock. 
        
            * 
        
            * In circumstances where only one CPU exists, the behavior of 
        
            * k_spin_lock() remains as specified above, though obviously no 
        
            * spinning will take place.  Implementations may be free to optimize 
        
            * in uniprocessor contexts such that the locking reduces to an 
        
            * interrupt mask operation. 
        
            * 
        
            * @param l A pointer to the spinlock to lock 
        
            * @return A key value that must be passed to k_spin_unlock() when the 
        
            *         lock is released. 
        
            */

The options I see are:

don't merge this patch ( 😢 )
merge patch as-is

Holders of irq_lock on context switch will inexplicably panic if CONFIG_SPIN_VALIDATE=y
...and it should be remarked that CONFIG_SPIN_VALIDATE usually defaults to y if CONFIG_ASSERT=y

update documentation of irq_lock() to indicate that holding irq_lock on context switch will panic when CONFIG_SPIN_VALIDATE=y
update documentation of irq_lock() to no longer allow holding irq_lock on context switch (breaking change?)

andyross · 2025-07-22T20:56:46Z

My vote would just be to remove that section of the docs and rip the bandaid off with the warning (which can always be disabled via kconfig anyway). There are precisely zero circumstances where breaking a critical section (!) lock with a context switch (!!) cannot lead to tears.

Code that does that is not only breaking its own carefully curated lock in an immensely clever way that it clearly thought carefully about and knows has no unexpected interactions. Because of the recursive behavior of irq_lock() it's also breaking the locks taken by all the callers up the stack, who had no idea that calling into this obnoxiously clever subsystem was a synchronization trap.

It just can't work. We never should have documented it (and I didn't even know we had!).

mathieuchopstm · 2025-07-23T09:50:49Z

What do you think of replacing the existing note:

zephyr/include/zephyr/irq.h

Lines 242 to 251 in 47b07e5

    
            * @note 
        
            * This routine can be called by ISRs or by threads. If it is called by a 
        
            * thread, the interrupt lock is thread-specific; this means that interrupts 
        
            * remain disabled only while the thread is running. If the thread performs an 
        
            * operation that allows another thread to run (for example, giving a semaphore 
        
            * or sleeping for N milliseconds), the interrupt lock no longer applies and 
        
            * interrupts may be re-enabled while other processing occurs. When the thread 
        
            * once again becomes the current thread, the kernel re-establishes its 
        
            * interrupt lock; this ensures the thread won't be interrupted until it has 
        
            * explicitly released the interrupt lock it established.

with the following?

 * @note
 * This routine can be called by ISRs or by threads.
 * Only isr-ok functions may be called while holding the interrupt lock.
 * It is a fatal error to hold the interrupt lock upon return from ISR.

These statements sound correct from my understanding of the kernel:

isr-ok functions are the only ones guaranteed to not context switch
Kernel might reschedule upon return-from-ISR

We could add a more direct It is a fatal error to hold the interrupt lock upon context switch but I feel like these statements are equivalent, while being easier to understand from people not necessarily familiar with kernel terminology.

If this is fine for you, I'll add a new commit that edits irq.h.

krish2718 · 2025-07-23T09:54:48Z

It is a fatal error to hold the interrupt lock upon context switch

How about the infamous Linux wording BUG: Scheduling while atomic?

mathieuchopstm · 2025-07-23T11:02:10Z

It is a fatal error to hold the interrupt lock upon context switch

How about the infamous Linux wording BUG: Scheduling while atomic?

We already have a runtime message 🙂

zephyr/kernel/include/kswap.h

Lines 98 to 100 in 47b07e5

    
           __ASSERT(arch_irq_unlocked(key) || 
        
           	 _current->base.thread_state & (_THREAD_DUMMY | _THREAD_DEAD), 
        
           	 "Context switching while holding lock!");

What needs updating is the documentation for irq_lock() as it isn't consistent with the rest of the kernel.

bjarki-andreasen · 2025-07-23T11:04:11Z

I think we need the explicit wording of "It is a fatal error to hold the interrupt lock upon context switch", since a thread can force a reschedule/context switch at any time by calling k_yield() for example.

mathieuchopstm · 2025-07-23T11:18:53Z

I think we need the explicit wording of "It is a fatal error to hold the interrupt lock upon context switch", since a thread can force a reschedule/context switch at any time by calling k_yield() for example.

I don't think k_yield() is isr-ok though?

(Really, my Only isr-ok functions may be called while holding the interrupt lock is a non-kernel-wording way of saying It is a fatal error to hold the interrupt lock upon context switch - I don't mind adding it anyways if you feel like it's better though)

bjarki-andreasen · 2025-07-23T13:03:20Z

I think we need the explicit wording of "It is a fatal error to hold the interrupt lock upon context switch", since a thread can force a reschedule/context switch at any time by calling k_yield() for example.

I don't think k_yield() is isr-ok though?

(Really, my Only isr-ok functions may be called while holding the interrupt lock is a non-kernel-wording way of saying It is a fatal error to hold the interrupt lock upon context switch - I don't mind adding it anyways if you feel like it's better though)

I disagree with the way the note is worded, since when I'm reading it, I'm reading the three lines as a strangely phrased paragraph, and in my mind auto-correcting it to make it make sense. Could you split the three notes into three notes, or phrase it as a paragraph?

 * @note This routine can be called by ISRs or by threads.
 * @note Only isr-ok functions may be called while holding the interrupt lock.
 * @note It is a fatal error to hold the interrupt lock upon return from ISR.

or

* @note This function can be called by ISRs and threads. After calling this
* routine, the caller is "holding the interrupt lock". While holding the interrupt lock,
* only isr-ok functions may be called, as these don't result in context switches.
* While holding the interrupt lock, a context switch, which could result from calling
* a non isr-ok function, or returning from an ISR, is a fatal error.

andyross · 2025-07-23T13:51:31Z

Bikeshed aside, I think adding a warning in the docs is fine, but IMHO it should be as short and clear as possible. "Context switching while holding the irq_lock is illegal." is fine, etc... The user will be alerted by the assert if they trip over it. (Also nitpick: it's not a "fatal error" as nothing actually fails or panics in an obvious way, which is why this is a booby trap.)

Upon context switch, the virtual "interrupt lock" acquired by irq_lock() must not be "held". However, the current documentation for irq_lock() says that it is perfectly valid to hold it (!), and that a suspended thread will hold the "interrupt lock" upon being scheduled again (!!). Update the documentation to remove the outdated section and indicate that context switching while holding the interrupt lock is not allowed. Signed-off-by: Mathieu Choplain <mathieu.choplain@st.com>

Threads must not attempt to context switch if they are holding a spinlock. Add this information to the documentation for k_spin_lock(). Signed-off-by: Mathieu Choplain <mathieu.choplain@st.com>

mathieuchopstm · 2025-07-23T14:23:44Z

Modified the irq_lock() (and k_spin_lock() while at it) documentation with short comments.

sonarqubecloud · 2025-07-23T14:48:46Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

andyross

Looks clear to me.

bjarki-andreasen

Seems perfect

zephyrbot added the area: Kernel label Jul 21, 2025

zephyrbot requested review from andyross, ceolin, cfriedt, dcpleung, nashif, npitre, peter-mitsis and TaiJuWu July 21, 2025 12:11

zephyrbot assigned andyross and peter-mitsis Jul 21, 2025

andyross previously approved these changes Jul 21, 2025

View reviewed changes

krish2718 previously approved these changes Jul 21, 2025

View reviewed changes

peter-mitsis previously approved these changes Jul 21, 2025

View reviewed changes

bjarki-andreasen previously approved these changes Jul 21, 2025

View reviewed changes

TaiJuWu previously approved these changes Jul 22, 2025

View reviewed changes

mathieuchopstm added 2 commits July 23, 2025 16:21

kernel: spinlock: update k_spin_lock() documentation wrt context switch

a173881

Threads must not attempt to context switch if they are holding a spinlock. Add this information to the documentation for k_spin_lock(). Signed-off-by: Mathieu Choplain <mathieu.choplain@st.com>

mathieuchopstm dismissed stale reviews from TaiJuWu and bjarki-andreasen via a173881 July 23, 2025 14:23

mathieuchopstm dismissed stale reviews from peter-mitsis, krish2718, and andyross via a173881 July 23, 2025 14:23

zephyrbot added the area: Interrupt Controller label Jul 23, 2025

zephyrbot requested a review from ycsin July 23, 2025 14:23

andyross approved these changes Jul 23, 2025

View reviewed changes

mathieuchopstm changed the title ~~kernel: assert no spinlock is held on swap when !USE_SWITCH~~ kernel: clarify and assert interrupts aren't disabled upon context switch Jul 23, 2025

JarmouniA approved these changes Jul 23, 2025

View reviewed changes

bjarki-andreasen approved these changes Jul 23, 2025

View reviewed changes

krish2718 approved these changes Jul 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kernel: clarify and assert interrupts aren't disabled upon context switch #93440

kernel: clarify and assert interrupts aren't disabled upon context switch #93440

mathieuchopstm commented Jul 21, 2025

Uh oh!

andyross left a comment

Uh oh!

krish2718 left a comment

Uh oh!

mathieuchopstm commented Jul 22, 2025

Uh oh!

andyross commented Jul 22, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025 •

edited

Loading

Uh oh!

krish2718 commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

bjarki-andreasen commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

bjarki-andreasen commented Jul 23, 2025 •

edited

Loading

Uh oh!

andyross commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

sonarqubecloud bot commented Jul 23, 2025

Uh oh!

andyross left a comment

Uh oh!

bjarki-andreasen left a comment

Uh oh!

Uh oh!

kernel: clarify and assert interrupts aren't disabled upon context switch #93440

Are you sure you want to change the base?

kernel: clarify and assert interrupts aren't disabled upon context switch #93440

Conversation

mathieuchopstm commented Jul 21, 2025

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

krish2718 left a comment

Choose a reason for hiding this comment

Uh oh!

mathieuchopstm commented Jul 22, 2025

Uh oh!

andyross commented Jul 22, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krish2718 commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

bjarki-andreasen commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

bjarki-andreasen commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyross commented Jul 23, 2025

Uh oh!

mathieuchopstm commented Jul 23, 2025

Uh oh!

sonarqubecloud bot commented Jul 23, 2025

Quality Gate passed

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

bjarki-andreasen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mathieuchopstm commented Jul 23, 2025 •

edited

Loading

bjarki-andreasen commented Jul 23, 2025 •

edited

Loading